Takehome_ex01

1 Introduction

1.1 Background

The housing market in Singapore comprises both public and private sectors. Public housing serves individuals with monthly household incomes up to S$14,000, while those with higher incomes typically turn to the private residential market.

1.2 Project Objectives

In this exercise, Exploratory Data Analysis (EDA) methods and ggplot functions are used to explore:

  • the distribution of

  • the relationship

2.Getting started

2.1 Importing libraries

We load the following R packages using the pacman::p_load() function:

#|eval=FALSE
pacman::p_load(tidyverse, ggstatsplot, ggridges, plotly, dplyr,wesanderson,lubridate)

2.2 loading data

The dataset used in the exercise is transaction data of REALIS

We import this dataset by using read.csv() function

files1<-read.csv("data/ResidentialTransaction20240308160536.csv")
files2<-read.csv("data/ResidentialTransaction20240308160736.csv")
files3<-read.csv("data/ResidentialTransaction20240308161009.csv")
files4<-read.csv("data/ResidentialTransaction20240308161109.csv")
files5<-read.csv("data/ResidentialTransaction20240414220633.csv")

2.3 Glimpse of data

Using the glimpse() function, we see the first 5 rows of five datasets (files1 to files5). These datasets likely contain private residential property transaction data, possibly segmented by different criteria such as date, type, and price. Further exploration is needed to understand the meaning of the fields, data types, presence of missing values, and how to integrate these datasets for analysis and visualization.

Show the code
head(files1,5)
             Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.
1 THE REEF AT KING'S DOCK            2,317,000      882.65              2,625
2         URBAN TREASURES            1,823,500      882.65              2,066
3              NORTH GAIA            1,421,112     1,076.4              1,320
4              NORTH GAIA            1,258,112    1,033.34              1,218
5           PARC BOTANNIA            1,280,000      871.88              1,468
    Sale.Date                       Address Type.of.Sale Type.of.Area
1 01 Jan 2023 12 HARBOURFRONT AVENUE #05-32     New Sale       Strata
2 02 Jan 2023        205 JALAN EUNOS #08-02     New Sale       Strata
3 02 Jan 2023        29 YISHUN CLOSE #08-10     New Sale       Strata
4 02 Jan 2023        45 YISHUN CLOSE #07-42     New Sale       Strata
5 03 Jan 2023     12 FERNVALE STREET #06-16       Resale       Strata
  Area..SQM. Unit.Price....PSM. Nett.Price...         Property.Type
1         82             28,256             -           Condominium
2         82             22,238             -           Condominium
3        100             14,211             - Executive Condominium
4         96             13,105             - Executive Condominium
5         81             15,802             -           Condominium
  Number.of.Units                 Tenure Completion.Date
1               1 99 yrs from 12/01/2021     Uncompleted
2               1               Freehold     Uncompleted
3               1 99 yrs from 15/02/2021     Uncompleted
4               1 99 yrs from 15/02/2021     Uncompleted
5               1 99 yrs from 28/12/2016            2022
  Purchaser.Address.Indicator Postal.Code Postal.District Postal.Sector
1                         HDB       97996               4             9
2                     Private      419535              14            41
3                         HDB      269343              27            26
4                         HDB      269294              27            26
5                         HDB      797391              28            79
    Planning.Region Planning.Area
1    Central Region   Bukit Merah
2       East Region         Bedok
3      North Region        Yishun
4      North Region        Yishun
5 North East Region      Sengkang
Show the code
head(files2,5)
     Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.
1     THE GAZANIA            1,528,000      678.13              2,253
2     THE GAZANIA            1,938,000         958              2,023
3  ONE PEARL BANK            2,051,000      699.66              2,931
4 URBAN TREASURES            1,850,700      882.65              2,097
5 HYLL ON HOLLAND            2,021,500      699.66              2,889
    Sale.Date                 Address Type.of.Sale Type.of.Area Area..SQM.
1 01 Apr 2023 15 HOW SUN DRIVE #02-31     New Sale       Strata         63
2 01 Apr 2023  7 HOW SUN DRIVE #01-12     New Sale       Strata         89
3 01 Apr 2023     1 PEARL BANK #32-16     New Sale       Strata         65
4 01 Apr 2023  205 JALAN EUNOS #05-05     New Sale       Strata         82
5 01 Apr 2023  97 HOLLAND ROAD #10-25     New Sale       Strata         65
  Unit.Price....PSM. Nett.Price... Property.Type Number.of.Units
1             24,254             -   Condominium               1
2             21,775             -   Condominium               1
3             31,554             -     Apartment               1
4             22,570             -   Condominium               1
5             31,100             -   Condominium               1
                  Tenure Completion.Date Purchaser.Address.Indicator
1               Freehold            2022                         N.A
2               Freehold            2022                     Private
3 99 yrs from 01/03/2019     Uncompleted                     Private
4               Freehold     Uncompleted                         HDB
5               Freehold     Uncompleted                     Private
  Postal.Code Postal.District Postal.Sector   Planning.Region Planning.Area
1      538545              19            53 North East Region     Serangoon
2      538530              19            53 North East Region     Serangoon
3      169016               3            16    Central Region        Outram
4      419535              14            41       East Region         Bedok
5      278541              10            27    Central Region   Bukit Timah
Show the code
head(files3,5)
  Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.   Sale.Date
1         MYRA            1,658,000      667.37              2,484 01 Jul 2023
2   NORTH GAIA            1,449,000     1,076.4              1,346 01 Jul 2023
3   NORTH GAIA            1,365,000     1,076.4              1,268 01 Jul 2023
4   NORTH GAIA            1,231,000         958              1,285 01 Jul 2023
5   NORTH GAIA            1,272,000    1,001.05              1,271 01 Jul 2023
                         Address Type.of.Sale Type.of.Area Area..SQM.
1 9 MEYAPPA CHETTIAR ROAD #02-07     New Sale       Strata         62
2         27 YISHUN CLOSE #10-06     New Sale       Strata        100
3         27 YISHUN CLOSE #05-06     New Sale       Strata        100
4         35 YISHUN CLOSE #08-25     New Sale       Strata         89
5         45 YISHUN CLOSE #09-45     New Sale       Strata         93
  Unit.Price....PSM. Nett.Price...         Property.Type Number.of.Units
1             26,742             -             Apartment               1
2             14,490             - Executive Condominium               1
3             13,650             - Executive Condominium               1
4             13,831             - Executive Condominium               1
5             13,677             - Executive Condominium               1
                  Tenure Completion.Date Purchaser.Address.Indicator
1               Freehold     Uncompleted                         N.A
2 99 yrs from 15/02/2021     Uncompleted                         HDB
3 99 yrs from 15/02/2021     Uncompleted                         HDB
4 99 yrs from 15/02/2021     Uncompleted                         HDB
5 99 yrs from 15/02/2021     Uncompleted                         HDB
  Postal.Code Postal.District Postal.Sector Planning.Region Planning.Area
1      358456              13            35  Central Region     Toa Payoh
2      769342              27            76    North Region        Yishun
3      769342              27            76    North Region        Yishun
4      769299              27            76    North Region        Yishun
5      769294              27            76    North Region        Yishun
Show the code
head(files4,5)
   Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.   Sale.Date
1  LEEDON GREEN            1,749,000       538.2              3,250 01 Oct 2023
2      LIV @ MB            3,148,740    1,453.14              2,167 01 Oct 2023
3          MORI            2,422,337    1,259.39              1,923 01 Oct 2023
4     THE ARDEN            1,330,000      721.19              1,844 01 Oct 2023
5 LENTOR MODERN            2,237,000    1,130.22              1,979 01 Oct 2023
                     Address Type.of.Sale Type.of.Area Area..SQM.
1   26 LEEDON HEIGHTS #11-08     New Sale       Strata         50
2    114A ARTHUR ROAD #01-01     New Sale       Strata        135
3 223 GUILLEMARD ROAD #05-21     New Sale       Strata        117
4      6 PHOENIX ROAD #01-18     New Sale       Strata         67
5    3 LENTOR CENTRAL #05-03     New Sale       Strata        105
  Unit.Price....PSM. Nett.Price... Property.Type Number.of.Units
1             34,980             -   Condominium               1
2             23,324             -   Condominium               1
3             20,704             -     Apartment               1
4             19,851             -     Apartment               1
5             21,305             -     Apartment               1
                  Tenure Completion.Date Purchaser.Address.Indicator
1               Freehold     Uncompleted                     Private
2 99 yrs from 23/11/2021     Uncompleted                     Private
3               Freehold     Uncompleted                     Private
4 99 yrs from 14/07/2023     Uncompleted                     Private
5 99 yrs from 26/10/2021     Uncompleted                     Private
  Postal.Code Postal.District Postal.Sector   Planning.Region Planning.Area
1      266221              10            26    Central Region   Bukit Timah
2      439826              15            43    Central Region Marine Parade
3      399738              14            39    Central Region       Geylang
4      668159              23            66       West Region   Bukit Batok
5      788888              26            78 North East Region    Ang Mo Kio
Show the code
head(files5,5)
       Project.Name Transacted.Price.... Area..SQFT. Unit.Price....PSF.
1      THE LANDMARK            2,726,888     1,076.4              2,533
2 POLLEN COLLECTION            3,850,000    1,808.35              2,129
3    SKY EDEN@BEDOK            2,346,000    1,087.16              2,158
4        TERRA HILL            2,190,000       807.3              2,713
5     PINETREE HILL            1,954,000      796.54              2,453
    Sale.Date                   Address Type.of.Sale Type.of.Area Area..SQM.
1 01 Jan 2024 173 CHIN SWEE ROAD #22-11     New Sale       Strata        100
2 01 Jan 2024           34 POLLEN PLACE     New Sale         Land        168
3 01 Jan 2024    1 BEDOK CENTRAL #09-10     New Sale       Strata        101
4 01 Jan 2024 18A YEW SIANG ROAD #03-12     New Sale       Strata         75
5 01 Jan 2024      36 PINE GROVE #05-18     New Sale       Strata         74
  Unit.Price....PSM. Nett.Price... Property.Type Number.of.Units
1             27,269             -   Condominium               1
2             22,917             - Terrace House               1
3             23,228             -     Apartment               1
4             29,200             -     Apartment               1
5             26,405             -   Condominium               1
                  Tenure Completion.Date Purchaser.Address.Indicator
1 99 yrs from 28/08/2020     Uncompleted                     Private
2 99 yrs from 09/12/2019     Uncompleted                         N.A
3 99 yrs from 05/01/2022     Uncompleted                         HDB
4               Freehold     Uncompleted                         N.A
5 99 yrs from 12/09/2022     Uncompleted                     Private
  Postal.Code Postal.District Postal.Sector   Planning.Region Planning.Area
1      169878               3            16    Central Region        Outram
2      807233              28            80 North East Region     Serangoon
3      469657              16            46       East Region         Bedok
4      118992               5            11    Central Region    Queenstown
5      598444              21            59    Central Region   Bukit Timah

2.3.1 Union Dataset

This code checks if the column names of the five datasets are consistent.

all_colnames <- c(names(files1), names(files2), names(files3), names(files4), names(files5))
if(length(unique(all_colnames)) != length(names(files1))) {
  stop("Column names are different, files need adjustment to have consistent column names.")
}

This code checks if the data types of the five datasets are consistent.

for(col in names(files1)) {
  if(!all(sapply(list(files1, files2, files3, files4, files5), function(x) class(x[[col]]) == class(files1[[col]])))) {
    stop("Data types are different, files need adjustment to have consistent data types.")
  }
}

The five files can be merged successfully. They have consistent column names, data types, and equal row counts, making them compatible for union operation.

uniondata <- rbind(files1, files2, files3, files4, files5)

2.3.2 Check Dataset

Checking the structure of data using str():

str(uniondata)
'data.frame':   26806 obs. of  21 variables:
 $ Project.Name               : chr  "THE REEF AT KING'S DOCK" "URBAN TREASURES" "NORTH GAIA" "NORTH GAIA" ...
 $ Transacted.Price....       : chr  "2,317,000" "1,823,500" "1,421,112" "1,258,112" ...
 $ Area..SQFT.                : chr  "882.65" "882.65" "1,076.4" "1,033.34" ...
 $ Unit.Price....PSF.         : chr  "2,625" "2,066" "1,320" "1,218" ...
 $ Sale.Date                  : chr  "01 Jan 2023" "02 Jan 2023" "02 Jan 2023" "02 Jan 2023" ...
 $ Address                    : chr  "12 HARBOURFRONT AVENUE #05-32" "205 JALAN EUNOS #08-02" "29 YISHUN CLOSE #08-10" "45 YISHUN CLOSE #07-42" ...
 $ Type.of.Sale               : chr  "New Sale" "New Sale" "New Sale" "New Sale" ...
 $ Type.of.Area               : chr  "Strata" "Strata" "Strata" "Strata" ...
 $ Area..SQM.                 : chr  "82" "82" "100" "96" ...
 $ Unit.Price....PSM.         : chr  "28,256" "22,238" "14,211" "13,105" ...
 $ Nett.Price...              : chr  "-" "-" "-" "-" ...
 $ Property.Type              : chr  "Condominium" "Condominium" "Executive Condominium" "Executive Condominium" ...
 $ Number.of.Units            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Tenure                     : chr  "99 yrs from 12/01/2021" "Freehold" "99 yrs from 15/02/2021" "99 yrs from 15/02/2021" ...
 $ Completion.Date            : chr  "Uncompleted" "Uncompleted" "Uncompleted" "Uncompleted" ...
 $ Purchaser.Address.Indicator: chr  "HDB" "Private" "HDB" "HDB" ...
 $ Postal.Code                : int  97996 419535 269343 269294 797391 548742 275780 423510 126751 738086 ...
 $ Postal.District            : int  4 14 27 27 28 19 10 15 5 25 ...
 $ Postal.Sector              : int  9 41 26 26 79 54 27 42 12 73 ...
 $ Planning.Region            : chr  "Central Region" "East Region" "North Region" "North Region" ...
 $ Planning.Area              : chr  "Bukit Merah" "Bedok" "Yishun" "Yishun" ...

using is.na we see there are no missing values in the data

for(column_name in names(uniondata)){
na_count<- sum(is.na(uniondata[[column_name]]))
}
na_count
[1] 0

Using the duplicated function, we see that there are no duplicate entries in the data.

uniondata[duplicated(uniondata),]
 [1] Project.Name                Transacted.Price....       
 [3] Area..SQFT.                 Unit.Price....PSF.         
 [5] Sale.Date                   Address                    
 [7] Type.of.Sale                Type.of.Area               
 [9] Area..SQM.                  Unit.Price....PSM.         
[11] Nett.Price...               Property.Type              
[13] Number.of.Units             Tenure                     
[15] Completion.Date             Purchaser.Address.Indicator
[17] Postal.Code                 Postal.District            
[19] Postal.Sector               Planning.Region            
[21] Planning.Area              
<0 rows> (or 0-length row.names)

2.3.3 data set description

Field Description
Project.Name The name of the housing project or development.
Transacted.Price The price at which the property was transacted.
Area (SQFT) The area of the property in square feet.
Unit.Price (PSF) The price per square foot of the property.
Sale.Date The date on which the property was sold.
Address The address of the property.
Type.of.Sale The type of sale, e.g., resale, new sale, sub sale.
Type.of.Area The type of area, e.g., built-up area, land area.
Area (SQM) The area of the property in square meters. 1SQM=10.764SQFT
Unit.Price (PSM) The price per square meter of the property.
Nett.Price The final price after deductions or adjustments.
Property.Type The type of property, e.g., condominium, landed house, HDB flat.
Number.of.Units The number of units in the housing development.
Tenure The land tenure of the property, e.g., freehold, leasehold.
Completion.Date The date when the property construction was completed.
Purchaser.Address.Indicator It refers to the type of residence (HDB flat or private property) of the purchaser’s address as shown in the caveat, and is not indicative of whether the place of residence is owned by the purchaser. Where the information is not available, ‘N.A’ is indicated.
Postal.Code The postal code of the property.
Postal.District The postal district of the property.
Postal.Sector The postal sector of the property.
Planning.Region Singapore is divided into 5 areas called planning regions to facilitate the planning of the use and development of land for the whole of Singapore. The 5 regions are Central Region, East Region, North East Region, North Region and West Region. Each Planning Region consists of several planning areas which are further divided into sub-zones. A Development Guide Plan (DGP) is prepared for each planning area (see Development Guide Plan).
Planning.Area The planning area where the property is located.

2.3.4 Exploration of Dataset Features

The following code snippet is used to explore categorical data features, including counting and identifying unique values for different features. This will help us better understand the structure and content of the dataset, providing a foundation for further exploration and analysis.

code
library(dplyr)

# Convert Type.of.Sale
unique_type_of_sale <- unique(uniondata$Type.of.Sale)
table_type_of_sale <- table(uniondata$Type.of.Sale)
df_type_of_sale <- data.frame(Type_of_Sale = unique_type_of_sale, Count = as.vector(table_type_of_sale))

# Convert Type.of.Area
unique_type_of_area <- unique(uniondata$Type.of.Area)
table_type_of_area <- table(uniondata$Type.of.Area)
df_type_of_area <- data.frame(Type_of_Area = unique_type_of_area, Count = as.vector(table_type_of_area))

# Convert Property.Type
unique_property_type <- unique(uniondata$Property.Type)
table_property_type <- table(uniondata$Property.Type)
df_property_type <- data.frame(Property_Type = unique_property_type, Count = as.vector(table_property_type))

# Convert Number.of.Units
unique_number_of_units <- unique(uniondata$Number.of.Units)
table_number_of_units <- table(uniondata$Number.of.Units)
df_number_of_units <- data.frame(Number_of_Units = unique_number_of_units, Count = as.vector(table_number_of_units))

# Convert Completion.Date
unique_completion_date <- unique(uniondata$Completion.Date)
table_completion_date <- table(uniondata$Completion.Date)
df_completion_date <- data.frame(Completion_Date = unique_completion_date, Count = as.vector(table_completion_date))

# Convert Purchaser.Address.Indicator
unique_purchaser_address_indicator <- unique(uniondata$Purchaser.Address.Indicator)
table_purchaser_address_indicator <- table(uniondata$Purchaser.Address.Indicator)
df_purchaser_address_indicator <- data.frame(Purchaser_Address_Indicator = unique_purchaser_address_indicator, Count = as.vector(table_purchaser_address_indicator))

# Convert Postal.District
unique_postal_district <- unique(uniondata$Postal.District)
table_postal_district <- table(uniondata$Postal.District)
df_postal_district <- data.frame(Postal_District = unique_postal_district, Count = as.vector(table_postal_district))

# Convert Postal.Sector
unique_postal_sector <- unique(uniondata$Postal.Sector)
table_postal_sector <- table(uniondata$Postal.Sector)
df_postal_sector <- data.frame(Postal_Sector = unique_postal_sector, Count = as.vector(table_postal_sector))

# Convert Planning.Region
unique_planning_region <- unique(uniondata$Planning.Region)
table_planning_region <- table(uniondata$Planning.Region)
df_planning_region <- data.frame(Planning_Region = unique_planning_region, Count = as.vector(table_planning_region))

# Convert Planning.Area
unique_planning_area <- unique(uniondata$Planning.Area)
table_planning_area <- table(uniondata$Planning.Area)
df_planning_area <- data.frame(Planning_Area = unique_planning_area, Count = as.vector(table_planning_area))

# Convert Planning.Area
unique_Nett.Price <- unique(uniondata$Nett.Price)
table_Nett.Price <- table(uniondata$Nett.Price)
df_Nett.Price <- data.frame(Nett.Price = unique_Nett.Price, Count = as.vector(table_Nett.Price))
df_type_of_sale
  Type_of_Sale Count
1     New Sale  8644
2       Resale 16617
3     Sub Sale  1545
df_type_of_area
  Type_of_Area Count
1       Strata  1570
2         Land 25236
df_property_type
          Property_Type Count
1           Condominium 10772
2 Executive Condominium 10643
3         Terrace House   233
4   Semi-Detached House  3534
5             Apartment   524
6        Detached House  1100
df_number_of_units
  Number_of_Units Count
1               1 26794
2              43     5
3               4     1
4              60     2
5               3     1
6               2     1
7              11     1
8               7     1
df_completion_date
   Completion_Date Count
1      Uncompleted   682
2             2022     1
3                -     2
4             2015     1
5             1941     1
6             2000     1
7             1958     3
8             2014     1
9             2011     5
10            2009     8
11            2008     5
12            1998     4
13            2017     6
14            2018     1
15            1997     8
16            1990     3
17            2019     2
18            2010     5
19            2003     4
20            2016     4
21            1999     4
22            2001    12
23            1995    10
24            2007     4
25            1994     5
26            2012     2
27            1991    13
28            1976    11
29            2006    10
30            2005    52
31            1996    56
32            2002    42
33            2013    70
34            1967    30
35            2021    37
36            2004    51
37            1980    26
38            1986    60
39            1983    55
40            1977    85
41            1993    95
42            1988    20
43            1978    20
44            2020    40
45            1975    58
46            1992    48
47            2023    85
48            1982    84
49            1989   193
50            1981   148
51            1979   236
52            1954   433
53            1985   424
54            1984   461
55            1973   485
56            1968   192
57            1972   245
58            1964   265
59            1955   419
60            1961   355
61            1970   196
62            1956   226
63            1957   367
64            1969   376
65            1952   398
66            1987   522
67            1974   419
68            1963   702
69            1960  1247
70            1962  1172
71            1931  1440
72            0216  1293
73            1965  1287
74            1966   457
75            1951   173
76            2024   344
77            1971   712
78            1953  1032
79            1940     3
80            1959  8752
df_purchaser_address_indicator
  Purchaser_Address_Indicator Count
1                         HDB  8719
2                     Private  2733
3                         N.A 15354
df_postal_district
   Postal_District Count
1                4   181
2               14   295
3               27   875
4               28   338
5               19  1195
6               10     9
7               15   303
8                5   277
9               25  1003
10              18  1682
11               8   826
12               9   548
13              11   567
14              16   940
15               1  2763
16              20   998
17              22   357
18               3  1480
19              23  2920
20               7   696
21              21  1449
22              17   899
23              12  2662
24              13    18
25               6   322
26               2  1365
27              24  1270
28              26   568
df_postal_sector
   Postal_Sector Count
1              9   112
2             41     9
3             26    60
4             79   219
5             54    76
6             27   286
7             42    52
8             12   267
9             73   593
10            43   335
11            53   248
12            52   241
13            45   386
14            21     9
15            22   244
16            30    59
17            82   115
18            76   162
19            46   584
20             1   419
21            48   390
22            57   415
23            64   421
24            51   458
25            14   249
26            67    66
27            75   511
28            66    91
29            25   381
30            18    76
31            55    97
32            59   347
33            47   123
34            50   246
35            32   269
36            65   152
37            13   273
38            68   459
39            39  1962
40            24   155
41            23   187
42            35   704
43            58   157
44            17   137
45            36     3
46            56   354
47            44   478
48            15  1002
49            38  1205
50            40   827
51            16   435
52            28   139
53             7   557
54            20   879
55            80   570
56            33   423
57            69   141
58            29     3
59            19   332
60             8   861
61            61   413
62            34   817
63             6   571
64            31    18
65            60   322
66            11   212
67            78  1056
68            77    17
69            10  1348
70             5   324
71            63   244
72            49   453
df_planning_region
    Planning_Region Count
1    Central Region 12200
2       East Region  3644
3      North Region  5037
4 North East Region  1625
5       West Region  4300
df_planning_area
      Planning_Area Count
1       Bukit Merah  1558
2             Bedok  1753
3            Yishun   460
4          Sengkang  1314
5           Hougang   610
6       Bukit Timah   828
7     Marine Parade  2344
8          Clementi     3
9         Woodlands   594
10        Serangoon   647
11          Tanglin   587
12         Tampines  1227
13          Kallang  1209
14           Rochor   531
15           Novena   368
16          Punggol   681
17        Sembawang    25
18    Downtown Core  1551
19           Bishan    65
20      Jurong West   348
21        Pasir Ris   931
22       Queenstown   102
23    Bukit Panjang   251
24      Bukit Batok   832
25           Museum     1
26           Newton   453
27 Southern Islands   945
28        Toa Payoh   374
29    Choa Chu Kang   209
30          Geylang   502
31     River Valley   836
32          Orchard   981
33  Singapore River   140
34           Outram   106
35           Tengah     2
36       Ang Mo Kio  1055
37      Jurong East   612
38           Mandai    18
39     Sungei Kadut   657
40         Changi     320
41       Paya Lebar   776
df_Nett.Price
   Nett.Price Count
1           - 26770
2   2,129,200     1
3   2,018,062     1
4   2,224,300     1
5   1,752,240     1
6   2,289,600     1
7   2,307,600     1
8   2,325,600     1
9   2,397,600     1
10  2,433,600     1
11  2,451,600     1
12  3,960,900     1
13  1,506,600     1
14  3,987,900     1
15  2,335,888     1
16  2,343,600     1
17  3,697,000     1
18  3,180,900     1
19  1,941,900     1
20  2,137,192     1
21  3,050,000     1
22  3,445,100     1
23  1,935,070     1
24  1,606,400     1
25 11,712,000     1
26 14,212,000     1
27  2,040,900     1
28 14,463,000     1
29 14,358,000     1
30  1,950,000     1
31 14,124,000     1
32 12,203,000     1
33  1,990,000     1
34  1,558,579     1
35 11,795,000     1
36  1,628,000     1
37  1,492,600     1

Before diving into detailed data exploration, it’s essential to understand the key numerical variables in our dataset. The summary() function provides a concise summary of these variables, including measures of central tendency, dispersion, and distribution.

summary(uniondata$Transacted.Price....)
   Length     Class      Mode 
    26806 character character 
summary(uniondata$Unit.Price....PSF.)
   Length     Class      Mode 
    26806 character character 
summary(uniondata$Area..SQFT.)
   Length     Class      Mode 
    26806 character character 
Note

Summary of the Dateset findings:

  1. Upon examining the data’s structure, missing values, and duplicates, it appears that the data is tidy and does not require further processing.

  2. The information contained in the “Project.Name” and “Address” columns appears overly intricate and does not significantly contribute to the analysis. Therefore, these columns can be safely removed.

  3. Both “SQM” and “SQFT” serve as units of measurement for area. It suffices to retain one of them for analysis purposes.

  4. The “Tenure” information can be simplified by retaining only the duration in years and removing the specific date portion, facilitating clearer analysis.

  5. Given considerations for visualization, the “Postal” information is redundant. It is suggested to eliminate the “Postal.Sector” column.

  6. Anomalies such as the presence of values like 60 and 43 in the “Number of Units” column warrant further investigation to ascertain whether they are indeed outliers or erroneous entries.

  7. Upon inspecting the “Nett.Price” column, it is observed that only 27 rows contain specific numerical values, while the rest are empty. Hence, this feature proves ineffective and can be safely discarded.

This concise summary encapsulates the key findings of the EDA process, laying the groundwork for further analysis and decision-making.

3 Data preparation

3.1 Recast variable

Transacted.Price….,Transacted.Price….,Area..SQFT. are of type chr,As they should be numeric, they will be cast as num class.

# Remove commas and convert to numeric
uniondata$Transacted.Price.... <- as.numeric(gsub(",", "", uniondata$Transacted.Price....))
uniondata$Area..SQFT. <- as.numeric(gsub(",", "", uniondata$Area..SQFT.))
uniondata$Unit.Price....PSF. <- as.numeric(gsub(",", "", uniondata$Unit.Price....PSF.))

3.2 Create new variables

From the original Tenure column, extracting the numeric part, removing non-numeric characters, and classifying based on the extracted numeric values.

uniondata <- uniondata %>%
  mutate(
    Tenure_cleaned = gsub("[^0-9]", "", Tenure),
    Tenure_numeric = as.numeric(substr(Tenure_cleaned, 1, 3)),
    Tenure_level = case_when(
      grepl("Freehold", Tenure) ~ "Freehold",
      Tenure_numeric < 200 ~ "< 200",
      TRUE ~ "> 200"
    )
  )
unique(uniondata$Tenure_level)
[1] "> 200"    "Freehold" "< 200"   

3.3 Segmenting Variables into Groups

Creates boxplots with summary statistics, aiming to understand the distribution and characteristics of these variables visually.

filter_area <- uniondata %>% 
  filter(Area..SQFT. < 5000)

library(ggplot2)
library(patchwork)

# Calculate summary statistics for Area..SQFT.
mean_val <- mean(filter_area$`Area..SQFT.`)
median_val <- median(filter_area$`Area..SQFT.`)
q1_val <- quantile(filter_area$`Area..SQFT.`, 0.25)
q3_val <- quantile(filter_area$`Area..SQFT.`, 0.75)

# Create boxplot for Area..SQFT.
p1 <- ggplot(filter_area, aes(x = "", y = `Area..SQFT.`)) +
  geom_boxplot(fill = "#1a80bb") +  # Fill boxplot with color
  stat_summary(fun = "mean", geom = "point", shape = 8,
               size = 2, color = "#fefefe") +
  annotate("text", x = 0.8, y = max(filter_area$`Area..SQFT.`), 
           label = paste("Mean:", round(mean_val, 2), "\n",
                         "Median:", round(median_val, 2), "\n",
                         "Q1:", round(q1_val, 2), "\n",
                         "Q3:", round(q3_val, 2)), 
           vjust = 1, hjust = 0.5, size = 3) +
  labs(y = "Area (SQFT)", x="",title = "Boxplot of Area (SQFT)") +
  theme_minimal()+
  theme( text = element_text(size = 8))

# Calculate summary statistics for Unit.Price....PSF.
mean_val_psf <- mean(uniondata$`Unit.Price....PSF.`)
median_val_psf <- median(uniondata$`Unit.Price....PSF.`)
q1_val_psf <- quantile(uniondata$`Unit.Price....PSF.`, 0.25)
q3_val_psf <- quantile(uniondata$`Unit.Price....PSF.`, 0.75)

# Create boxplot for Unit.Price....PSF.
p2 <- ggplot(uniondata, aes(x = "", y = `Unit.Price....PSF.`)) +
  geom_boxplot(fill = "#1a80bb") +  # Fill boxplot with color
  stat_summary(fun = "mean", geom = "point", shape = 8,
               size = 2, color = "#fefefe") +
  annotate("text", x = 0.8, y = max(uniondata$`Unit.Price....PSF.`), 
           label = paste("Mean:", round(mean_val_psf, 2), "\n",
                         "Median:", round(median_val_psf, 2), "\n",
                         "Q1:", round(q1_val_psf, 2), "\n",
                         "Q3:", round(q3_val_psf, 2)), 
           vjust = 1, hjust = 0.5, size = 3) +
  labs(y = "Price (PSF)", x="", title = "Boxplot of Price (PSF)") +
  theme_minimal()+
  theme( text = element_text(size = 8))

# Combine the plots
(p1 + p2)

Note

Based on the boxplot results, the Area and Price need to be grouped.

They are divided into four groups: Small, Medium, Large, and plus, based on the first quartile (Q1), median, third quartile (Q3), and outliers (values beyond 1.5 times the interquartile range, IQR).

These groupings help categorize the data into different size categories, providing insights into the distribution and variability of the Area and Price variables.

# Grouping the Area variable
uniondata$Area_Group <- cut(uniondata$`Area..SQFT.`, 
                            breaks = c(-Inf, quantile(uniondata$`Area..SQFT.`, 0.25),
                                       quantile(uniondata$`Area..SQFT.`, 0.5),
                                       quantile(uniondata$`Area..SQFT.`, 0.75),
                                       Inf),
                            labels = c("Small", "Medium", "Large", "Plus"))

# Grouping the Price variable
uniondata$Price_Group <- cut(uniondata$`Unit.Price....PSF.`, 
                             breaks = c(-Inf, quantile(uniondata$`Unit.Price....PSF.`, 0.25),
                                        quantile(uniondata$`Unit.Price....PSF.`, 0.5),
                                        quantile(uniondata$`Unit.Price....PSF.`, 0.75),
                                        Inf),
                             labels = c("Low", "Medium-Low", "Medium-High", "High"))

3.4 Deciding on Plausible Values to Use

Through exploring the dataset in the previous step, removing irrelevant or redundant columns can facilitate the subsequent analysis. Additionally, the analysis of the dataset is limited to the first quarter of 2024.

uniondata$Sale.Date <- dmy(uniondata$Sale.Date)
cols_to_exclude <- c("Project.Name", "Nett.Price...", "Address", "Area..SQM.","Unit.Price....PSM.","Postal.Code","Postal.Sector","Tenure_cleaned","Tenure_numeric","Tenure")

data2024 <- uniondata %>%
  select(-any_of(cols_to_exclude))
head(data2024)
  Transacted.Price.... Area..SQFT. Unit.Price....PSF.  Sale.Date Type.of.Sale
1              2317000      882.65               2625 2023-01-01     New Sale
2              1823500      882.65               2066 2023-01-02     New Sale
3              1421112     1076.40               1320 2023-01-02     New Sale
4              1258112     1033.34               1218 2023-01-02     New Sale
5              1280000      871.88               1468 2023-01-03       Resale
6              5870000     3322.85               1767 2023-01-03       Resale
  Type.of.Area         Property.Type Number.of.Units Completion.Date
1       Strata           Condominium               1     Uncompleted
2       Strata           Condominium               1     Uncompleted
3       Strata Executive Condominium               1     Uncompleted
4       Strata Executive Condominium               1     Uncompleted
5       Strata           Condominium               1            2022
6         Land         Terrace House               1               -
  Purchaser.Address.Indicator Postal.District   Planning.Region Planning.Area
1                         HDB               4    Central Region   Bukit Merah
2                     Private              14       East Region         Bedok
3                         HDB              27      North Region        Yishun
4                         HDB              27      North Region        Yishun
5                         HDB              28 North East Region      Sengkang
6                     Private              19 North East Region       Hougang
  Tenure_level Area_Group Price_Group
1        > 200     Medium        High
2     Freehold     Medium Medium-High
3        > 200      Large         Low
4        > 200      Large         Low
5        > 200     Medium  Medium-Low
6        > 200       Plus Medium-High
str(data2024
    )
'data.frame':   26806 obs. of  16 variables:
 $ Transacted.Price....       : num  2317000 1823500 1421112 1258112 1280000 ...
 $ Area..SQFT.                : num  883 883 1076 1033 872 ...
 $ Unit.Price....PSF.         : num  2625 2066 1320 1218 1468 ...
 $ Sale.Date                  : Date, format: "2023-01-01" "2023-01-02" ...
 $ Type.of.Sale               : chr  "New Sale" "New Sale" "New Sale" "New Sale" ...
 $ Type.of.Area               : chr  "Strata" "Strata" "Strata" "Strata" ...
 $ Property.Type              : chr  "Condominium" "Condominium" "Executive Condominium" "Executive Condominium" ...
 $ Number.of.Units            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Completion.Date            : chr  "Uncompleted" "Uncompleted" "Uncompleted" "Uncompleted" ...
 $ Purchaser.Address.Indicator: chr  "HDB" "Private" "HDB" "HDB" ...
 $ Postal.District            : int  4 14 27 27 28 19 10 15 5 25 ...
 $ Planning.Region            : chr  "Central Region" "East Region" "North Region" "North Region" ...
 $ Planning.Area              : chr  "Bukit Merah" "Bedok" "Yishun" "Yishun" ...
 $ Tenure_level               : chr  "> 200" "Freehold" "> 200" "> 200" ...
 $ Area_Group                 : Factor w/ 4 levels "Small","Medium",..: 2 2 3 3 2 4 4 4 1 4 ...
 $ Price_Group                : Factor w/ 4 levels "Low","Medium-Low",..: 4 3 1 1 2 3 1 3 3 1 ...

4 Exploratory Data Analysis

4.1 EDA1 Real Estate Market Trend Analysis

`summarise()` has grouped output by 'ym'. You can override using the `.groups`
argument.
`summarise()` has grouped output by 'Date'. You can override using the
`.groups` argument.
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

library(tidyverse)

# Group data by month and year, calculate average house price
df_avg <- uniondata %>%
  filter(Purchaser.Address.Indicator != "N.A") %>%
  mutate(ym = format(Sale.Date, "%Y-%m")) %>%
  group_by(ym, Purchaser.Address.Indicator) %>%
  summarise(avg_price = mean(Transacted.Price....))

# Plot 1
plot1 <- ggplot(df_avg, aes(x = ym, y = avg_price, fill = Purchaser.Address.Indicator)) +
  geom_col(position = position_dodge()) +
  scale_fill_manual(values =  c("#082a54", "#f0c571"), name = "Property type") +
  geom_hline(yintercept = mean(df_avg$avg_price), color = "red", linetype = "dashed") +  
  geom_text(aes(x = 1, y = mean(df_avg$avg_price), 
                label = paste("Avg Price:", round(mean(df_avg$avg_price), 2))), 
                color = "red", size = 3, hjust = -0.2, vjust = -0.5) + 
  labs(x = "Month", y = "Price(PSF)", title = "Price (PSF) of HDB by Property type and Month") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none",
        text = element_text(size = 8))


# Count sales by month and property type
df_count <- uniondata %>%
  mutate(Date = format(Sale.Date, "%Y-%m")) %>%
  filter(Purchaser.Address.Indicator != "N.A") %>%
  group_by(Date, Purchaser.Address.Indicator) %>%
  summarise(Sales = n()) %>%
  group_by(Date) %>%
  mutate(Percentage = Sales / sum(Sales))

# Plot 2
plot2 <- ggplot(df_count, aes(x = Date, y = Sales, fill = Purchaser.Address.Indicator)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = scales::percent(Percentage, accuracy = 1)), 
            position = position_stack(vjust = 0.5), 
            size = 3, 
            color = "#4A5859") + 
  scale_fill_manual(values =  c("#1a80bb", "#f2c45f"), name = "Property type") +
  labs(x = "Month", y = "Sales", title = "Monthly Sales by Property type") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = c(0.8, 0.9), legend.title = element_blank(), 
        legend.key.size = unit(0.4, 'cm'), legend.key.height = unit(0.4, 'cm'), 
        legend.key.width = unit(0.4, 'cm'), text = element_text(size = 8)) 

# Combine plots
  plot1 + plot2
Note

Based on the provided charts, we ca analyze the recent price trends and sales situation in different types of properties as follow:

Price Trends:

  • Prices for public housing (HDB) have remained stable, averaging between 1700 to 1800 SGD per square foot monthly. In contrast, private real estate prices fluctuate more and are consistently higher than HDB prices.

  • The long-term average price for HDB units is lower than the overall average, reflecting its affordability and welfare nature.

  • Prices in the first quarter of 2024 have decreased compared to 2023, especially for private real estate.

Sales Situation:

  • Monthly sales volumes for both HDB and private real estate maintain a ratio of around 60% to 40%, showing stable market demand.

  • The surge in HDB sales volume in 2023 has returned to normal levels in the first quarter of 2024.

  • Despite fluctuations, HDB sales consistently account for around 60% of total sales, highlighting its importance in meeting housing needs.

4.2 EDA2 Price Distribution of Housing in Five Regions of Singapore

$x
[1] "Region"

$y
[1] "Price(PSF)"

attr(,"class")
[1] "labels"
Picking joint bandwidth of 60.4

library(ggplot2)
library(ggridges)

# Calculate median values
median_values <- data2024 %>%
  group_by(Planning.Region) %>%
  summarise(median_value = median(Unit.Price....PSF.))

# Recode Planning.Region
data2024 <- data2024 %>%
  mutate(Planning.Region = recode(Planning.Region, 
                                  "Central Region" = "Central",
                                  "East Region" = "East",
                                  "North Region" = "North",
                                  "North East Region" = "North East",
                                  "West Region" = "West"
                                  ))

# Plot 1
Plot1 <- ggplot(data2024, aes(Unit.Price....PSF., Planning.Region)) +
  geom_density_ridges(aes(fill = Planning.Region), alpha = 0.2, quantile_lines = TRUE, quantile_fun = function(x, ...) median(x)) +
  geom_text(data = median_values, aes(x = median_value - 200, y = Planning.Region, label = paste("Median Value:", round(median_value, 2))),
            hjust = -0.2, vjust = -0.5, color = "black", size = 3) +
  theme_minimal() +
  theme(legend.position = "bottom",
        text = element_text(size = 8),
        plot.title = element_text(size = 12, face = "bold"),
        legend.title = element_blank(),
        legend.key.size = unit(0.4, 'cm'), 
        legend.key.height = unit(0.4, 'cm'), 
        legend.key.width = unit(0.4, 'cm')) +
  labs(x = "Price(PSF)", y = "Region", title = "Unit Price Distribution by Region")

# Plot 2
Plot2 <- data2024 %>% 
  filter(Purchaser.Address.Indicator != "N.A") %>% 
  ggplot(aes(Planning.Region, Unit.Price....PSF., fill = Purchaser.Address.Indicator)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom",
        text = element_text(size = 8),
        legend.title = element_blank()) +
  labs(x = "Region", y = "Price(PSF)")

# Combine plots
Plot1 + Plot2
Note

Based on the chart, we can analyze the price distribution of housing in the five regions of Singapore and different types of properties as follows:

  • The median price in the Central region is the highest at $2,224 per square foot, while the West region has the lowest median price at $1,529 per square foot. This may reflect the high-end positioning and high demand for properties in the city center.

  • The price distribution follows a normal distribution curve, but the peak and spread of each region’s distribution vary. The curves for the Central and West regions are steeper, indicating that prices are more concentrated in these areas, while those for the Northeast, North, and East regions are flatter, indicating greater price variation.

  • The box plots show significant price variations within each region, especially in the Northeast and West regions, where the price dispersion is the highest.

  • Private properties generally have higher prices than public housing (HDB), especially in the West and North regions, where the price gap between the two types of properties is significant. This aligns with the independent ownership and higher positioning of private properties.

  • Even within the same region, there are significant price differences between private properties and HDB flats, reflecting the diverse nature of the Singapore real estate market.

4.3 EDA3 Comparison of Unit Price and Area for Different Property Types in Singapore

`summarise()` has grouped output by 'Purchaser.Address.Indicator'. You can
override using the `.groups` argument.
Attaching package: 'scales'
The following object is masked from 'package:purrr':

discard
The following object is masked from 'package:readr':

col_factor

df_Typeprice <- uniondata %>%
  filter(Purchaser.Address.Indicator !="N.A") %>% 
  group_by(Purchaser.Address.Indicator, Property.Type) %>%
  summarise(Price = median(Unit.Price....PSF.),Area=median( Area..SQFT.))
# Loading required packages

library(tidyverse)
library(scales)
library(ggrepel)

# Plot
# Plot
ggplot(df_Typeprice, aes(x = Area, y = Price)) +
  geom_point(aes(size = Price, color = Purchaser.Address.Indicator), alpha = 0.6) +
  geom_text_repel(aes(label = Property.Type), size = 2.5) +
  scale_color_manual(values = c("#C83E4D", "#4A5859"), name = "Property Type") +  # Change legend title
  labs(title = "Comparison of Area and Unit Price by Property Type", x = "Area", y = "Price (PSF)") +  # Change axis labels
  theme(text = element_text(size = 10), 
        plot.title = element_text(size = 12, face = "bold"),
        legend.key.size = unit(0.5, 'cm'), 
        legend.key.height = unit(0.5, 'cm'), 
        legend.key.width = unit(0.5, 'cm'),
        legend.title = element_text(size = 8))  
Note

Based on the chart data, we can can analyze the unit area prices and sizes for different types of properties as follows:

  1. Private properties have higher prices than public housing (HDB), reflecting their nature and market positioning.

  2. There are significant price differences among different types of private properties, with detached houses being the highest priced and condominiums being relatively lower.

  3. Public housing (HDB) shows less variation in prices across different types, reflecting uniform price control and affordability.

  4. Detached and semi-detached houses generally have larger sizes, while apartments and condominiums have smaller sizes.

  5. Executive condominiums have the lowest unit area prices among different types of apartments, likely due to their mixed characteristics of private apartments and affordable housing.

4.4 EDA4 Regional Deviation in Unit Prices Across Singapore

library(ggplot2)

# Private
df_Aavg <- uniondata %>%
  filter(Purchaser.Address.Indicator == "Private") %>%
  group_by(Planning.Area) %>%
  summarise(avg_Aprice = mean(Unit.Price....PSF.))

df_Aavg$p_z <- round((df_Aavg$avg_Aprice - mean(df_Aavg$avg_Aprice)) / sd(df_Aavg$avg_Aprice), 2)
df_Aavg$p_ztype <- ifelse(df_Aavg$p_z < 0, "below", "above")
df_Aavg <- df_Aavg[order(df_Aavg$p_z), ]

df_Aavg$Planning.Area <- factor(df_Aavg$Planning.Area, levels = df_Aavg$Planning.Area)

plot11 <- ggplot(df_Aavg, aes(x = Planning.Area, y = p_z, label = p_z)) +
  geom_bar(stat = "identity", aes(fill = p_ztype), position = position_dodge2(width = 2), width = 0.8) +
  scale_fill_manual(name = "Average Price", labels = c("Above Average", "Below Average"), values = c("below" = "#C83E4D", "above" = "#4A5859")) +
  labs(title = "Unit Price Deviations by Area", y = "", subtitle = "Private") +
  coord_flip() +
  theme(legend.position = "None", text = element_text(size = 8), plot.title = element_text(size = 12, face = "bold"))

# HDB
df_AHavg <- uniondata %>%
  filter(Purchaser.Address.Indicator == "HDB") %>%
  group_by(Planning.Area) %>%
  summarise(avg_Aprice = mean(Unit.Price....PSF.))

df_AHavg$p_z <- round((df_AHavg$avg_Aprice - mean(df_AHavg$avg_Aprice)) / sd(df_AHavg$avg_Aprice), 2)
df_AHavg$p_ztype <- ifelse(df_AHavg$p_z < 0, "below", "above")
df_AHavg <- df_AHavg[order(df_AHavg$p_z), ]

df_AHavg$Planning.Area <- factor(df_AHavg$Planning.Area, levels = df_AHavg$Planning.Area)

plot22 <- ggplot(df_AHavg, aes(x = Planning.Area, y = p_z, label = p_z)) +
  geom_bar(stat = "identity", aes(fill = p_ztype), position = position_dodge2(width = 2), width = 0.8) +
  scale_fill_manual(name = "Average Price", labels = c("Above Average", "Below Average"), values = c("below" = "#C83E4D", "above" = "#4A5859")) +
  labs(y = "", x = "", subtitle = "HDB") +
  coord_flip() +
  theme(text = element_text(size = 8),
        legend.title = element_blank(),
        legend.position = c(0.25, 0.9),
        legend.key.size = unit(0.4, 'cm'),
        legend.key.height = unit(0.4, 'cm'),
        legend.key.width = unit(0.4, 'cm'))

plot11 + plot22
Note

Based on this chart, we can analyze the deviation of unit housing prices in different regions of Singapore:

Private Properties:

  • Private property prices in Chinatown, Newton, and Orchard Road areas are significantly higher than the average, with the highest deviations. These areas have prime locations, which may be the primary reason.

  • Suburban areas such as Bukit Panjang, Serangoon, Balestier, and Bedok have private property prices below the average.

HDB Flats:

  • HDB flat prices in River Valley, Chinatown, and Newton areas are higher than the average, as these areas are close to the city center, which may be the main influencing factor.

  • Suburban areas such as Balestier, Serangoon, Bedok, and Choa Chu Kang have HDB flat prices generally below the average.

5 Conclusion